Biostatistics For Dummies (Monika Wahi John Pezzullo)

eventual failure. If you record the values to your variables the wrong way in your data, it may

take an enormous amount of additional effort to go back and fix them, and depending upon the

error, a fix may not even be possible!

Dealing with free-text data

It’s best to limit free-text variables that are difficult to box into one of the four levels of

measurement, such as participant comments or write-in fields for Other choices in a

questionnaire. Basically, you should only collect free-text variables when you need to record

verbatim what someone said or wrote. Don’t use free-text fields as a lazy-person’s substitute for

what should be precisely defined categorical data. Doing any meaningful statistical analysis of

free-text fields is generally very difficult, if not impossible.

You should also be aware that most software has field-length limitations for text fields. Although

commonly used statistical programs like Microsoft Excel, SPSS, SAS, R, and Python may allow for

long data fields, this does not excuse you from designing your study so as to limit collection of free-

text variables. Flip to Chapter 4 for an introduction to statistical software.

Assigning participant study identification (ID) numbers

Every participant in your study should have a unique participant study identifier (typically called a

study ID). The study ID is present in the participant’s data and is used for identifying the participant on

study materials (for example, laboratory specimens sent for analysis). You may need to combine two

variables to create a unique identifier. In a single-site study that is carried out at only one

geographical location, the study ID can be a whole number that is two- to four-digits long. It doesn’t

have to start at 1; it can start at 100 if you want all the ID numbers to be three-digits long without

leading zeros. In multi-site studies that are carried out at several locations (such as different clinics or

labs), the number often follows some logic. For example, it could have two parts, such as a site

number and a local study ID number separated by a hyphen (for example, 03-104), which is where you

need two variables to get a unique ID.

Organizing name and address data in the study ID crosswalk

A research database should not include private identifying information for the participant, such

as the participant’s full name and home address. Yet, these data need to be accessible to study

staff to facilitate the research. Private data like this is typically stored in a spreadsheet called a

study ID crosswalk. This spreadsheet keeps a link (or crosswalk) between the participant’s study

ID and their private data not to be stored in the research database. When you store names in the

study ID crosswalk, choose one of the following formats so that you can easily sort participants

into alphabetical order, or use the spreadsheet to facilitate study mailings:

A single variable: Last, First Middle (like Smith, John A)

Two columns: One for Last, another for First and Middle